Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you can use R Markdown syntax.
I hope that I could learn some useful techique with R, and that I could analysis data after the semester ends.
Describe the work you have done this week and summarize your learning.
The is a data with 60 variables. Through analyzing the data, we hope to undertand what are the important variables which is related to exam points.
step 1:Data Cleaning To analyze the data, the first step is to clean the data (scale the “Attitude” column), and select the information we are interested in. Since there are too many variables (183 observations and 60 variables), and this would make it hard to make analysis, I combine some of the variables, and put them in three big categories, which are deep, surface, strategic. After that, I average the value of deep_columns, surface_columns, strategic_columns.In the end, I keep the rows where points is greater than zero.These are what I do in “data cleaning” step.
step 2: Show a graphical overview of the data and show summaries of the variables in the data.
## Loading required package: ggplot2
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
After data cleaning step, the data now has 166 observations and 7 variables and I start drawing some plots.
The plots show us the distribution between two variables, and each varible. They also show us data distribytion by gender. I found that there is a positive correlation between attitude and points (0.43), deep and surf has negative correlation. From the box charts, I found that the data of age, deep questions is more condensed. However, the value age has lots of outliers.
From the summary, we see below information, which shows the minimum, maximum, mean, quartile value of each variable(age, attitude, deep)
step 3:Choose attitude, deep question, strategic question variables as explanatory variables and fit a regression model where exam points is the target (dependent) variable.
##
## Call:
## lm(formula = points ~ attitude + deep + stra, data = students2014)
##
## Coefficients:
## (Intercept) attitude deep stra
## 11.3915 3.5254 -0.7492 0.9621
##
## Call:
## lm(formula = points ~ attitude + deep + stra, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.5239 -3.4276 0.5474 3.8220 11.5112
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.3915 3.4077 3.343 0.00103 **
## attitude 3.5254 0.5683 6.203 4.44e-09 ***
## deep -0.7492 0.7507 -0.998 0.31974
## stra 0.9621 0.5367 1.793 0.07489 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared: 0.2097, Adjusted R-squared: 0.195
## F-statistic: 14.33 on 3 and 162 DF, p-value: 2.521e-08
I choose points as Y value, and attitude, deep question, strategic question as X value to create a multiple regression. According to the summary result, we can see that p-value is 2.521e-08, which is smaller than 0.05, hence we can say that the model is reasonable. However, since standard error is quite large, the estimation isn’t too precise. Though multiple R-squared, adjusted R-squared are both low, which are 0.2097, 0.195, we couldn’t underestimate the model’s explanatory ability, since lots of factors should be taken into account, and that R-squared isn’t the only element to consider a regresison model.
step 4: Produce Residuals vs Fitted plot, Normal QQ-plot and Residuals vs Leverage plot.
1.Residuals vs Fitted plot: A “good” residuals vs. fitted plot should has no obvious outliers, and be generally symmetrically distributed around the 0 line without particularly large residuals. From the plot, we can see that X and Y values are not correlated, hence this is a suitable model.
2.Normal QQ-plot: According to the theory, if both sets of quantiles come from the same distribution, we should see the points forming a line that’s roughly straight. Also, the points should fall approximately along the 45 degree reference line. From the plot, we could see that the points fall approximately on the 45-degree reference line, which means that the data sets come from similar distributions.
3.Residuals vs Leverage plot: This plot helps identify influential data points on the model. The points which are outside the red dashed Cook’s distance line are the points that would be influential in the model, and removing them would likely noticeably alter the regression results.
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## 'data.frame': 382 obs. of 35 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 3 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 6 4 10 2 4 10 0 6 0 0 ...
## $ G1 : int 5 5 7 15 6 15 12 6 16 14 ...
## $ G2 : int 6 5 8 14 10 15 12 5 18 15 ...
## $ G3 : int 6 6 10 15 10 15 11 6 19 15 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
This is adata discussing about students’ alchohol consumption.There are 382 observations and 35 variables. Variables include student’s, sex, age,family size,alcohol consumption, parents’education status, job, and so on. Through the analysis, I want to study the relationships between high/low alcohol consumption and some of the other variables in the data.
I assume that “studytime”(weekly study time), “failures”(number of past class failures), “goout”(frequency of going out with friends), “freetime”(free time after school), are important variables which have strong relationship with the consumption of alchohol. My hypothesis is that students who have less study time per week, fail more classes, go out with friends more often, have more free time after school consume more alchohol.
1.Numerically and graphically explore the distributions
## # A tibble: 4 x 4
## # Groups: sex [2]
## sex high_use count mean_study_time
## <fct> <lgl> <int> <dbl>
## 1 F FALSE 157 2.34
## 2 F TRUE 41 2
## 3 M FALSE 113 1.88
## 4 M TRUE 71 1.62
## # A tibble: 4 x 4
## # Groups: sex [2]
## sex high_use count mean_failures
## <fct> <lgl> <int> <dbl>
## 1 F FALSE 157 0.204
## 2 F TRUE 41 0.439
## 3 M FALSE 113 0.239
## 4 M TRUE 71 0.479
According to the summary statistics of study time group by sex and high_use, female who has shorter study time tend to consume more alchohol (more than 2 times per week), and those who studies longer tend not to consume that much alchohol(less than 2 times per week). The phenomenon is same for male. The result corresponds to my assumption.
Through the summary statistics of failures group by sex and high_use, female who failed more classes in the past tend to consume more alchohol, and those who failed less classes tend not to consume that much alchohol. This also happens at male. The result corresponds to my assumption.
The boxplots shows that for variable “goout”, female who consumes more alchohol goes out more. Male also has the same same situation. The result corresponds to what i assumed. And for variable “freetime”, female who consumes more alchohol has more freetime after school; however, the phenomenon is not too significant for male. The result is similiar to my assumption, but doesn’t exactly match.
2.Use logistic regression to statistically explore the relationship between your chosen variables and the binary high/low alcohol consumption variable as the target variable.
I choose “studytime”, “failures”, “goout”, “freetime” as four X variables, and fit them into the model which Y is “high_use” (high/low alcohol consumption). I also separate the data into male and female, in order to dig more into the data. Below are what I found from the model.
##
## Call:
## glm(formula = high_use ~ studytime + failures + goout + freetime,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8214 -0.7528 -0.5442 0.8552 2.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.36957 0.62399 -3.797 0.000146 ***
## studytime -0.57481 0.16784 -3.425 0.000615 ***
## failures 0.19303 0.16899 1.142 0.253334
## goout 0.70490 0.12039 5.855 4.77e-09 ***
## freetime 0.07209 0.13531 0.533 0.594163
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 395.17 on 377 degrees of freedom
## AIC: 405.17
##
## Number of Fisher Scoring iterations: 4
## (Intercept) studytime failures goout freetime
## -2.36956938 -0.57481413 0.19303395 0.70489610 0.07209276
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 0.09352099 0.0267779 0.3109179
## studytime 0.56280947 0.4007339 0.7752293
## failures 1.21292398 0.8699068 1.6929038
## goout 2.02363642 1.6081702 2.5811264
## freetime 1.07475503 0.8240334 1.4026604
The summary shows that standard error for “studytime”, “failures”, “goout”, “freetime” are 0.79, 0.79, 0.74, 0.75, which are comparatively small. Smaller standard error indicates that sample mean and the population mean is more similar, which means that sample data has a stronger explanatory power to the population. The coefficient for “studytime”, “failures”, “goout”, “freetime” are -2.94, -2.18, -1.67, -2.3, which means that these variables are all highly correlated with Y(“high_use”).
The odds ratio of “studytime” is 0.654(less than 1), “failures” is 1.34(higher than 1), “goout” is 2.114(higher than 1), “freetime” is 1.164(higher than 1), which means that “failures”, “goout”, “freetime” is positively associated with “high_use”. According to the confidence intervals,
According to the above result, I now figure out that “failures”, “goout”, “freetime” are important variables, which are correlated with high/low alcohol consumption. Since the odds ratio of “studytime” is less than one, it isn’t a factor which has high relationship with high/low alcohol consumption.
3.Using the variables which has statistical relationship with high/low alcohol consumption to explore the predictive power of you model.
## prediction
## high_use FALSE TRUE
## FALSE 248 22
## TRUE 76 36
Since I found that “studytime” isn’t an important factor, I remove this variable, and predict a model. According to the confusion matrix, we can calculate the precision is 248/(248+76) = 0.77, and the recall is 248/ (248+22) = 0.92. This means that the model has high recall, low precision, which means that most of the positive examples are correctly recognized, but there are a lot of false positives.The average number of wrong predictions in training data is 0.2565445, and the average number of wrong predictions in the cross validation 0.2486911. This means that the error of the model is quite high(around 25%).
1.Explore the structure and the dimensions of the data
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
## [1] 506 14
Boston data (housing values in suburbs of Boston) has 506 observations and 14 variables. Variables include “crim”(per capita crime rate by town), “zn”(proportion of residential land zoned for lots over 25,000 sq.ft), “rm”(average number of rooms per dwelling). Through analyzing the data, we hope to understand what are the features which affect housing price.
2.Show a graphical overview of the data and show summaries of the variables in the data.
## corrplot 0.84 loaded
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
The correlation matrix shows the relationship between two variables. If the dot is blue, it means that the two variables are positively correlated; if the dot is red, it means that the two variables are negatively correlated. As the color gets darker (blue or red), it means that the two variables have a stronger correlation. The dot is white or almost white if the two variables have weak correlation or no correlation. For example, “rad” and “tax” have high positive correlation; “lstat” and “medv”, “age” and “dis” have negative correlation. According to the summary, I see the minimum, maximum, mean, quartile value of each variable. For example, the average number of rooms per dwelling is 6.285, and the average crime rate 3.613.
3.Standardize the dataset and print out summaries of the scaled data.
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
After scaling the data, the mean of each variable is 0, and the difference between each variable become smaller.
4.Create a categorical variable of the crime rate in the Boston dataset, and drop the old crime rate variable from the dataset.
## crime
## low med_low med_high high
## 127 126 126 127
5.Divide the dataset to train and test sets, so that 80% of the data belongs to the train set, and fit the linear discriminant analysis on the train set.
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2549505 0.2549505 0.2500000 0.2400990
##
## Group means:
## zn indus chas nox rm
## low 0.9172982 -0.8791733 -0.081207697 -0.8600290 0.5015735
## med_low -0.1248659 -0.3220393 -0.004759149 -0.5623348 -0.1286361
## med_high -0.3581843 0.1684602 0.195445218 0.4343774 0.1316144
## high -0.4872402 1.0149946 0.011791568 1.0632950 -0.4575021
## age dis rad tax ptratio
## low -0.8813803 0.8301270 -0.7053471 -0.7605381 -0.48217523
## med_low -0.3081105 0.3387474 -0.5458997 -0.4705499 -0.05076457
## med_high 0.4237496 -0.3794713 -0.4155974 -0.3161717 -0.40340755
## high 0.8210526 -0.8680522 1.6596029 1.5294129 0.80577843
## black lstat medv
## low 0.37215747 -0.77975886 0.60342015
## med_low 0.32464046 -0.12448928 0.00625033
## med_high 0.09255862 0.02319678 0.22971777
## high -0.70475064 0.83894493 -0.65191096
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.138147603 0.60596054 -1.00809030
## indus -0.015403765 -0.23770783 0.05753724
## chas -0.051622705 0.01937468 0.13664845
## nox 0.241882493 -0.98643384 -1.26407958
## rm -0.109956056 -0.09876600 -0.16149914
## age 0.356291976 -0.34186011 -0.06535453
## dis -0.109935819 -0.34586588 0.09341477
## rad 3.478763696 0.87430205 -0.30998827
## tax -0.006822419 0.10700558 0.77730241
## ptratio 0.121113638 0.03940028 -0.21760293
## black -0.160509219 0.03305790 0.13390020
## lstat 0.145193143 -0.22247044 0.40482985
## medv 0.139507560 -0.41137725 -0.12478686
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9523 0.0351 0.0127
6.Save the crime categories from the test set and then remove the categorical crime variable from the test dataset, then predict the classes with the LDA model on the test data.
## predicted
## correct low med_low med_high high
## low 16 8 0 0
## med_low 4 15 4 0
## med_high 0 12 12 1
## high 0 0 1 29
The result of the cross tabulation shows the relation between the prediction and the correct answer. For example, there are 21 when the preidction is low and the correct answer is low, and there are 5 when the preidction is low and the correct answer is med_low.
7.Reload the Boston dataset and standardize the dataset, and calculate the distances between the observations.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
I calculate the distance with Euclidean distance measure and Manhattan distance measure. According to the result, the value from Manhattan distance measure is larger than Euclidean distance measure.(e.g. the median of Euclidean distance measure is 4.8241, and the median of Manhattan distance measure is 13.5488.)
8.Run k-means algorithm on the dataset.Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters and interpret the results.
The visualization of clusters show the distribution of clusters of different variables. Since we set 3 clusters, there are three colors in each plots, representing each cluster. The line chart shows that when the cluster is around 1.25, the total within sum of square is the highest.
1.Show a graphical overview of the data and show summaries of the variables in the data.
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
Human data includes 155 observations and 8 variables. The variables are “Edu2.FM”, “Labo.FM”, “Edu.Exp”, “Life.Exp”, “GNI”, “Mat.Mor”, “Ado.Birth”, “Parli.F”.
GGpairs shows the correlation between two variables, and I find that “Ado.Birth” and “Edu.Exp”, “Ado.Birth” and “Life.Exp”, “Mat.Mor” and “Edu.Exp”, “Mat.Mor” and “Life.Exp” have high negative correlation; “Life.Exp” and “Edu.Exp”, “Ado.Birth” and “Mat.Mor”have high positive correlation.
Corplot shows a better visualization than ggpairs, it shows the correaltion between each two variabes with different color. The two variables are more negative correlated if the color is red, and they are more positive correlated if the color is blue. However, corplot only shows general relationship between two variables, it can’t show the exact correlation.
2.Perform principal component analysis (PCA) on the not standardized human data.
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000
## PC8
## Standard deviation 0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion 1.0000
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
Since the variables are not standarized, the standard deviation is big, which means the value of the data is discreted. From the biplot, we can see that most data of the countries cluster together, the variables also have the same problem.
3.Standardize the variables in the human data and repeat the above analysis. Are the results different? Why or why not?
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
## PC7 PC8
## Standard deviation 0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion 0.98702 1.00000
The result before and after the data is standardized is different. Before the data is standardized, countries and other variables all gather together, which makes it hard to interpret; after the data is standardized, countries are more evenly distributed, and the other variables have more similar standard deviations( since the length of the arrows are almost the same).
4.Give your personal interpretations of the first two principal component dimensions based on the biplot drawn after PCA on the standardized human data.
After the human data is standardized, countries are distributed more evenly. The arrows shows the connections between the original features and the PC’s(PC1, PC2). The countries are placed on x and y coordinates defined by two PC’s. The angle between arrows represents the correlation between the features. Small angle = high positive correlation. We can see that except the correlation between “Parli.F”, “Labo.FM” and PC1, PC2, other variables all have high positive correlation with the PC’s.
The length of the arrows are proportional to the standard deviations of the features, from the plot, we can see that the variables have similar standard deviations.
5.Look at the structure and the dimensions of the tea data and visualize it. Interpret the results of the MCA and draw at least the variable biplot of the analysis.
Tea dataset includes 300 observations and 6 variables,which are:
“Tea”" : Factor 3 levels “black”,“Earl Grey”, “green” “How”" : Factor 4 levels “alone”,“lemon”, “milk”, “other” “how” : Factor 3 levels “tea bag”,“tea bag+unpackaged”, “unpackaged” “sugar” : Factor 2 levels “No.sugar”,“sugar” “where” : Factor 3 levels “chain store”, “chain store+tea shop”, “tea shop”
“lunch” : Factor 2 levels “lunch”,“Not.lunch”
The summary shows the detail of the data. It shows the amount of each variables as below:
Tea: How: how: sugar:
black : 74 alone:195 tea bag :170 No.sugar:155
Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
green : 33 milk : 63 unpackaged : 36
other: 9
where: lunch:
chain store :192 lunch : 44
chain store+tea shop: 78 Not.lunch:256
tea shop : 30
In tea variable, most data are “Earl Grey”(193); in How variable, most data are “alone”(195); in how variable, most data are “tea bag”(170); in sugar variable, most data are “No.sugar”(155); in where variable, most data are “chain store”(192); in lunch variable, most data are “Not.lunch”(256)
The visualization of the dataset visualize the summary, thus is easier for me to interpret the data. Next, I do multiple correspondence analysis. The summary shows the eigenvalues, individuals, categories and categorical variables. According to eigenvalues, we can see that Dim.1 and Dim.2 retain more percentage of variances than other dimensions. From v.test value in categories, the coordinate of “black”, “Earl Grey”, “green”, “lemon”, “milk”, “tea bag”, “tea bag+unpackaged”, “unpackaged” is significantly different from zero (since the value is below/above ± 1.96). According to categorical variables, we can see that “how” and “Dim.1”, “where” and “Dim.1”, have a stronger correlation.
MCA biplot shows the possible variable pattern. The distance between variables show the similarity between variables. For example, “lemon” and “alone” are more similar than “lemon” and “other”.